沟通压缩是现代分布式学习系统的至关重要技术,可以减轻其在较慢的网络上的交流瓶颈。尽管最近对数据并行式训练的梯度压缩进行了深入的研究,但压缩了通过管道并行性训练的模型的激活仍然是一个空旷的问题。在本文中,我们提出了AC-SGD,这是一种新型的激活压缩算法,用于在慢速网络上进行通信有效的管道并行性训练。 AC-SGD与以前的激活压缩方面的努力不同,而不是直接压缩激活值,而是压缩激活的变化。这使我们能够首次向我们的知识表明,仍然可以实现$ o(1/\ sqrt {t})$收敛速率,即激活压缩的非convex目标,而无需对梯度做出假设无偏见对于具有非线性激活功能的深度学习模型不符合。然后,我们证明AC-SGD可以有效地优化和实施,而无需额外的端到端运行时开销。我们将AC-SGD评估为微调语言具有高达15亿个参数的模型,将激活压缩至2-4位。AC-SGD在较慢的网络中可提供高达4.3倍的端到端速度,而无需牺牲模型质量。此外,我们还表明,AC-SGD可以与最先进的梯度压缩算法结合使用,以启用“端到端通信压缩:机器之间的所有通信,包括模型梯度,远期激活和后退梯度压缩为较低的精度。这提供了高达4.9倍的端到端加速,而无需牺牲模型质量。
translated by 谷歌翻译
训练基金会模型(例如GPT-3和Palm)可能非常昂贵,通常涉及数以万计的GPU连续运行数月。这些模型通常经过专门的群集培训,这些群集具有快速,均匀的互连,并使用精心设计的软件系统来支持数据并行性和模型/管道并行性。这样的专用集群可能是昂贵且难以获得的。我们可以相反,可以利用更大量的分散,异质和较低的互连计算?先前的工作研究了可以纯粹以数据并行方式训练的相对较小模型的异质,分散的设置重点。模型平行基础模型培训(例如威震天)的最先进的方案仅考虑均匀的数据中心设置。在本文中,我们介绍了第一个研究大型基础模型的研究,该模型在异质网络上的去中心化制度中进行了模型并行性。我们的主要技术贡献是一种调度算法,该算法将不同的计算“任务”在培训基础模型中分配给通过缓慢的异质网络连接的一组分散的GPU设备。我们提供了正式的成本模型,并进一步提出了一种有效的进化算法,以找到最佳分配策略。我们进行了广泛的实验,这些实验代表了使用现实世界网络测量模拟的地理分布设备进行学习的不同方案。在最极端的情况下,在跨越3大洲的8个不同的城市中,我们的方法比以前的最新培训系统(Megatron)快4.8倍。
translated by 谷歌翻译
过度分辨的神经网络概括井,但训练昂贵。理想情况下,人们希望减少其计算成本,同时保留其概括的益处。稀疏的模型培训是实现这一目标的简单和有希望的方法,但随着现有方法与准确性损失,慢速训练运行时的困难或困难,仍然存在挑战,仍然存在困难的挑战。核心问题是,在离散的一组稀疏矩阵上搜索稀疏性掩模是困难和昂贵的。为了解决此问题,我们的主要见解是通过具有称为蝴蝶矩阵产品的固定结构的固定结构来优化优化稀疏矩阵的连续超集。随着蝴蝶矩阵不是硬件效率,我们提出了简单的蝴蝶(块和平坦)的变体来利用现代硬件。我们的方法(像素化蝴蝶)使用基于扁平块蝴蝶和低秩矩阵的简单固定稀疏模式,以缩小大多数网络层(例如,注意,MLP)。我们经验验证了像素化蝴蝶比蝴蝶快3倍,加快培训,以实现有利的准确性效率权衡。在ImageNet分类和Wikitext-103语言建模任务中,我们的稀疏模型训练比致密的MLP - 混频器,视觉变压器和GPT-2媒体更快地训练高达2.5倍,没有精确下降。
translated by 谷歌翻译
Deep learning models can achieve high accuracy when trained on large amounts of labeled data. However, real-world scenarios often involve several challenges: Training data may become available in installments, may originate from multiple different domains, and may not contain labels for training. Certain settings, for instance medical applications, often involve further restrictions that prohibit retention of previously seen data due to privacy regulations. In this work, to address such challenges, we study unsupervised segmentation in continual learning scenarios that involve domain shift. To that end, we introduce GarDA (Generative Appearance Replay for continual Domain Adaptation), a generative-replay based approach that can adapt a segmentation model sequentially to new domains with unlabeled data. In contrast to single-step unsupervised domain adaptation (UDA), continual adaptation to a sequence of domains enables leveraging and consolidation of information from multiple domains. Unlike previous approaches in incremental UDA, our method does not require access to previously seen data, making it applicable in many practical scenarios. We evaluate GarDA on two datasets with different organs and modalities, where it substantially outperforms existing techniques.
translated by 谷歌翻译
The development of social media user stance detection and bot detection methods rely heavily on large-scale and high-quality benchmarks. However, in addition to low annotation quality, existing benchmarks generally have incomplete user relationships, suppressing graph-based account detection research. To address these issues, we propose a Multi-Relational Graph-Based Twitter Account Detection Benchmark (MGTAB), the first standardized graph-based benchmark for account detection. To our knowledge, MGTAB was built based on the largest original data in the field, with over 1.55 million users and 130 million tweets. MGTAB contains 10,199 expert-annotated users and 7 types of relationships, ensuring high-quality annotation and diversified relations. In MGTAB, we extracted the 20 user property features with the greatest information gain and user tweet features as the user features. In addition, we performed a thorough evaluation of MGTAB and other public datasets. Our experiments found that graph-based approaches are generally more effective than feature-based approaches and perform better when introducing multiple relations. By analyzing experiment results, we identify effective approaches for account detection and provide potential future research directions in this field. Our benchmark and standardized evaluation procedures are freely available at: https://github.com/GraphDetec/MGTAB.
translated by 谷歌翻译
As one of the prevalent methods to achieve automation systems, Imitation Learning (IL) presents a promising performance in a wide range of domains. However, despite the considerable improvement in policy performance, the corresponding research on the explainability of IL models is still limited. Inspired by the recent approaches in explainable artificial intelligence methods, we proposed a model-agnostic explaining framework for IL models called R2RISE. R2RISE aims to explain the overall policy performance with respect to the frames in demonstrations. It iteratively retrains the black-box IL model from the randomized masked demonstrations and uses the conventional evaluation outcome environment returns as the coefficient to build an importance map. We also conducted experiments to investigate three major questions concerning frames' importance equality, the effectiveness of the importance map, and connections between importance maps from different IL models. The result shows that R2RISE successfully distinguishes important frames from the demonstrations.
translated by 谷歌翻译
Compressed videos often exhibit visually annoying artifacts, known as Perceivable Encoding Artifacts (PEAs), which dramatically degrade video visual quality. Subjective and objective measures capable of identifying and quantifying various types of PEAs are critical in improving visual quality. In this paper, we investigate the influence of four spatial PEAs (i.e. blurring, blocking, bleeding, and ringing) and two temporal PEAs (i.e. flickering and floating) on video quality. For spatial artifacts, we propose a visual saliency model with a low computational cost and higher consistency with human visual perception. In terms of temporal artifacts, self-attention based TimeSFormer is improved to detect temporal artifacts. Based on the six types of PEAs, a quality metric called Saliency-Aware Spatio-Temporal Artifacts Measurement (SSTAM) is proposed. Experimental results demonstrate that the proposed method outperforms state-of-the-art metrics. We believe that SSTAM will be beneficial for optimizing video coding techniques.
translated by 谷歌翻译
We propose a distributionally robust return-risk model for Markov decision processes (MDPs) under risk and reward ambiguity. The proposed model optimizes the weighted average of mean and percentile performances, and it covers the distributionally robust MDPs and the distributionally robust chance-constrained MDPs (both under reward ambiguity) as special cases. By considering that the unknown reward distribution lies in a Wasserstein ambiguity set, we derive the tractable reformulation for our model. In particular, we show that that the return-risk model can also account for risk from uncertain transition kernel when one only seeks deterministic policies, and that a distributionally robust MDP under the percentile criterion can be reformulated as its nominal counterpart at an adjusted risk level. A scalable first-order algorithm is designed to solve large-scale problems, and we demonstrate the advantages of our proposed model and algorithm through numerical experiments.
translated by 谷歌翻译
Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving. Given the highly dynamic and variant nature of the input, the visuomotor driving task inherently lacks view and translation invariance, and the visual input contains massive irrelevant information for decision making, resulting in predominant pre-training approaches from general vision less suitable for the autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. The proposed PPGeo is performed in two stages to support effective self-supervised training. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only. As such, the pre-trained visual encoder is equipped with rich driving policy related representations and thereby competent for multiple visuomotor driving tasks. Extensive experiments covering a wide span of challenging scenarios have demonstrated the superiority of our proposed approach, where improvements range from 2% to even over 100% with very limited data. Code and models will be available at https://github.com/OpenDriveLab/PPGeo.
translated by 谷歌翻译
As one of the most important psychic stress reactions, micro-expressions (MEs), are spontaneous and transient facial expressions that can reveal the genuine emotions of human beings. Thus, recognizing MEs (MER) automatically is becoming increasingly crucial in the field of affective computing, and provides essential technical support in lie detection, psychological analysis and other areas. However, the lack of abundant ME data seriously restricts the development of cutting-edge data-driven MER models. Despite the recent efforts of several spontaneous ME datasets to alleviate this problem, it is still a tiny amount of work. To solve the problem of ME data hunger, we construct a dynamic spontaneous ME dataset with the largest current ME data scale, called DFME (Dynamic Facial Micro-expressions), which includes 7,526 well-labeled ME videos induced by 671 participants and annotated by more than 20 annotators throughout three years. Afterwards, we adopt four classical spatiotemporal feature learning models on DFME to perform MER experiments to objectively verify the validity of DFME dataset. In addition, we explore different solutions to the class imbalance and key-frame sequence sampling problems in dynamic MER respectively on DFME, so as to provide a valuable reference for future research. The comprehensive experimental results show that our DFME dataset can facilitate the research of automatic MER, and provide a new benchmark for MER. DFME will be published via https://mea-lab-421.github.io.
translated by 谷歌翻译